Visualization is a powerful tool for data exploration. But in the general case input dimension is high, so the visualisation is hard task. The dimension reduction is a method that allow to reduce data dimension for visualisation and for other purposes.
import pandas as pd
import numpy as np
import plotly.express as px
#Save Plotly figures with the interactive mode in HTML file
import plotly
plotly.offline.init_notebook_mode()
For more details about this dictionnry, please see the project entiment analysis with Naive Bayes Vs LSTM keras model
file=r'G:\Mon Drive\Personnel\05_Python_html_ext_code\08_AI_&_data science\Sentiment_analysis\Glove.npz'
loaded = np.load(file,allow_pickle=True)
Glove=loaded['Glove'].tolist()
words=['car','bus','train', 'woman','man','child','france','italy','germany']
category=['transport','transport','transport','human','human','human','country','country','country']
len(words),len(category)
(9, 9)
X=[]
for w in words:
X.append(Glove[w].tolist())
X=np.array(X)
X.shape
(9, 50)
Xn=(X-X.mean(axis=0))/X.std(axis=0)
Xn.shape
(9, 50)
*The input data has 50 columns, so it is hard to plot all this columns, The solution is to use the PCA algorithm to reduce the dimension from 50 to 2 only*
COV=np.cov(Xn, rowvar=False)
COV.shape
(50, 50)
EigenVals, EigenVecs = np.linalg.eigh(COV)
EigenVecs.shape
(50, 50)
EigenVals
array([-2.88818316e-15, -2.53898779e-15, -2.38941499e-15, -2.27499066e-15,
-2.23097527e-15, -1.90785440e-15, -1.74317347e-15, -1.72006427e-15,
-1.21593304e-15, -1.17649979e-15, -9.99778220e-16, -8.60631073e-16,
-7.56937735e-16, -6.74129019e-16, -5.63288864e-16, -4.45314686e-16,
-4.19832605e-16, -3.93899035e-16, -3.22564427e-16, -1.87234899e-16,
3.61433213e-18, 5.69336912e-17, 1.52957220e-16, 2.90611120e-16,
4.07079914e-16, 4.32576628e-16, 4.96066786e-16, 5.53562087e-16,
7.22395715e-16, 7.33001616e-16, 8.55283910e-16, 9.73504182e-16,
1.19645835e-15, 1.32362356e-15, 1.70395447e-15, 1.74594450e-15,
2.14249559e-15, 2.22708761e-15, 2.61559632e-15, 2.89292472e-15,
3.36249024e-15, 4.82687959e-15, 1.10442638e+00, 2.05198275e+00,
3.37678465e+00, 3.63730471e+00, 5.36501108e+00, 6.01310240e+00,
1.42403286e+01, 2.04610594e+01])
# Sort the eigenValues: Descending
index=np.argsort(EigenVals)[::-1]
index
array([49, 48, 47, 46, 45, 44, 43, 42, 41, 40, 39, 38, 37, 36, 35, 34, 33,
32, 31, 30, 29, 28, 27, 26, 25, 24, 23, 22, 21, 20, 19, 18, 17, 16,
15, 14, 13, 12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1, 0],
dtype=int64)
# EigenValues sorting
EigenVals=EigenVals[index]
EigenVals
array([ 2.04610594e+01, 1.42403286e+01, 6.01310240e+00, 5.36501108e+00,
3.63730471e+00, 3.37678465e+00, 2.05198275e+00, 1.10442638e+00,
4.82687959e-15, 3.36249024e-15, 2.89292472e-15, 2.61559632e-15,
2.22708761e-15, 2.14249559e-15, 1.74594450e-15, 1.70395447e-15,
1.32362356e-15, 1.19645835e-15, 9.73504182e-16, 8.55283910e-16,
7.33001616e-16, 7.22395715e-16, 5.53562087e-16, 4.96066786e-16,
4.32576628e-16, 4.07079914e-16, 2.90611120e-16, 1.52957220e-16,
5.69336912e-17, 3.61433213e-18, -1.87234899e-16, -3.22564427e-16,
-3.93899035e-16, -4.19832605e-16, -4.45314686e-16, -5.63288864e-16,
-6.74129019e-16, -7.56937735e-16, -8.60631073e-16, -9.99778220e-16,
-1.17649979e-15, -1.21593304e-15, -1.72006427e-15, -1.74317347e-15,
-1.90785440e-15, -2.23097527e-15, -2.27499066e-15, -2.38941499e-15,
-2.53898779e-15, -2.88818316e-15])
# EigenValues sorting
EigenVecs=EigenVecs[:,index]
This matrix will not be served in the current project
S=np.diag(EigenVals)
S.shape
(50, 50)
Out_dim=2
Sub_EigenVecs=EigenVecs[:,:Out_dim]
Sub_EigenVecs.shape
(50, 2)
Reminder of the input shape
X.T.shape
(50, 9)
Xr=Sub_EigenVecs.T.dot(X.T).T
Xr.shape
(9, 2)
Reminder of the input shape
X.shape
(9, 50)
df=pd.DataFrame(Xr,columns=['x1','x2'])
df['word']=words
df['category']=category
df
| x1 | x2 | word | category | |
|---|---|---|---|---|
| 0 | 0.823152 | 2.500449 | car | transport |
| 1 | 0.960812 | 3.640771 | bus | transport |
| 2 | 0.436880 | 2.558087 | train | transport |
| 3 | 2.452261 | -1.506051 | woman | human |
| 4 | 2.094862 | -1.170678 | man | human |
| 5 | 2.403375 | -1.911725 | child | human |
| 6 | -3.407146 | -1.153402 | france | country |
| 7 | -3.175158 | -0.729738 | italy | country |
| 8 | -3.332977 | -1.022581 | germany | country |
fig=px.scatter(df,x='x1',y='x2',color='category',text='word',width=800, height=600,
title='The transmorde data')
fig.show()
We can remark that each category is grouped in a specify area of the figure
def PCA (X,Out_dim=2,std_norm=False):
# X.shape: mxn , m is the nember of raws in data, n is the number of columns (features)
# Out_dim: the wanted output dimension, 2 is good for visualisation
# std_norm:if True, normalize X using STD also
# Normalization
if std_norm:
Xn=(X-X.mean(axis=0))/X.std(axis=0)
else:
Xn=(X-X.mean(axis=0))
# Covariance
COV=np.cov(Xn, rowvar=False)
#======== Singular Value Decomposition (SVD)================
# Eigenvectors
EigenVals, EigenVecs = np.linalg.eigh(COV)
# Sort the eigenValues: Descending
index=np.argsort(EigenVals)[::-1]
# Use the index to get a eigenVectors with the same order
EigenVecs=EigenVecs[:,index]
# get a sub EigenVectors
U=EigenVecs[:,:Out_dim]
#============================================================
# Comput the reduced X
Xr=U.T.dot(X.T).T
return Xr
Xr2=PCA (X,Out_dim=2)
df2=pd.DataFrame(Xr2,columns=['x1','x2'])
df2['word']=words
df2['category']=category
df2
| x1 | x2 | word | category | |
|---|---|---|---|---|
| 0 | 1.833871 | 2.239016 | car | transport |
| 1 | 2.182165 | 3.049211 | bus | transport |
| 2 | 1.314006 | 2.543402 | train | transport |
| 3 | 1.893575 | -2.657787 | woman | human |
| 4 | 1.778322 | -2.000878 | man | human |
| 5 | 1.787517 | -2.536059 | child | human |
| 6 | -3.665965 | 0.056487 | france | country |
| 7 | -3.298916 | 0.007370 | italy | country |
| 8 | -3.535607 | 0.202911 | germany | country |
fig=px.scatter(df2,x='x1',y='x2',color='category',text='word',width=800, height=600,
title='The transmorde data with PCA function')
fig.show()
from sklearn.decomposition import PCA as SklearnPCA
For more information see the link
http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html
pca = SklearnPCA(n_components=2)
pca.fit(X)
PCA(n_components=2)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
PCA(n_components=2)
Xr_sklearn=pca.transform(X)
dfs=pd.DataFrame(Xr_sklearn,columns=['x1','x2'])
dfs['word']=words
dfs['category']=category
dfs
| x1 | x2 | word | category | |
|---|---|---|---|---|
| 0 | -1.801764 | 2.138607 | car | transport |
| 1 | -2.150057 | 2.948803 | bus | transport |
| 2 | -1.281898 | 2.442994 | train | transport |
| 3 | -1.861468 | -2.758195 | woman | human |
| 4 | -1.746214 | -2.101286 | man | human |
| 5 | -1.755409 | -2.636467 | child | human |
| 6 | 3.698072 | -0.043921 | france | country |
| 7 | 3.331024 | -0.093039 | italy | country |
| 8 | 3.567715 | 0.102503 | germany | country |
fig=px.scatter(dfs,x='x1',y='x2',color='category',text='word',width=800, height=600,
title='Sklearn transmorde data')
fig.show()
dfs.x1*=-1
fig=px.scatter(dfs,x='x1',y='x2',color='category',text='word',width=800, height=600)
fig.show()
The local PCA function and the Sklearn PCA function have the same result.
Xr3=PCA (X,Out_dim=3)
Xr3.shape
(9, 3)
df3=pd.DataFrame(Xr3,columns=['x1','x2','x3'])
df3['word']=words
df3['category']=category
df3
| x1 | x2 | x3 | word | category | |
|---|---|---|---|---|---|
| 0 | 1.833871 | 2.239016 | 1.986809 | car | transport |
| 1 | 2.182165 | 3.049211 | 0.101166 | bus | transport |
| 2 | 1.314006 | 2.543402 | -0.456373 | train | transport |
| 3 | 1.893575 | -2.657787 | 0.990900 | woman | human |
| 4 | 1.778322 | -2.000878 | 2.066705 | man | human |
| 5 | 1.787517 | -2.536059 | -1.404173 | child | human |
| 6 | -3.665965 | 0.056487 | 0.787283 | france | country |
| 7 | -3.298916 | 0.007370 | 0.628147 | italy | country |
| 8 | -3.535607 | 0.202911 | 0.335505 | germany | country |
fig=px.scatter_3d(df3, x='x1', y='x2', z='x3',color='category',
text='word',width=800, height=600,title='3D ploting of the transormed data')
fig.show()
In this netbook we developed a PCA algorithm for dimensionality reduction using only NumPy library, we compared also the result with Sklearn PCA transformation.